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Abstract 

State-space models provide an important body of techniques for an- 
alyzing time-series, but their use requires estimating unobserved states. 
The optimal estimate of the state is its conditional expectation given 
the observation histories, and computing this expectation is hard when 
there are nonlinearities. Existing filtering methods, including sequential 
Monte Carlo, tend to be either inaccurate or slow. In this paper, we 
study a nonlinear filter for nonlinear/non-Gaussian state-space models, 
which uses Laplace's method, an asymptotic series expansion, to approxi- 
mate the state's conditional mean and variance, together with a Gaussian 
conditional distribution. This Laplace-Gaussian filter (LGF) gives fast, 
recursive, deterministic state estimates, with an error which is set by the 
stochastic characteristics of the model and is, we show, stable over time. 
We illustrate the estimation ability of the LGF by applying it to the prob- 
lem of neural decoding and compare it to sequential Monte Carlo both in 
simulations and with real data. We find that the LGF can deliver superior 
results in a small fraction of the computing time. 

Keywords: Laplace's method, recursive Bayesian estimation, neural de- 
coding 



1 Introduction 

The central statistical problem in applying state-space models is that of filter- 
ing, i.e., estimating the unobserved state from the observations. For nonlinear 
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or non-Gaussian models, considerable effort has been devoted to devising ap- 
proximate solutions to the filtering problem, based mainly on simulation meth- 
ods such as particle filtering and its variants (Kitagawa 1987; Kitagawa 1996; 
Doucet, de Freitas and Gordon 2001). In this article we study a deterministic 
approximation based on sequential application of Laplace's method which we 
call the Laplace Gaussian filter (LGF), and we illustrate the approach in the 
context of real-time neural decoding (Brockwell, Schwartz and Kass 2007; Eden, 
Frank, Barbieri, Solo and Brown 2004; Serruya, Hatsopoulos, Paninski, Fellows 
and Donoghue 2002). In this context we find the LGF to be far more accurate, 
for equivalent computational cost, than particle filtering. 

Suppose we have a stochastic state process {xt}, t = 1,2,... and a related 
observation process {yt}- Filtering consists of estimating the state x t given a 
sequence of observations yi,y2, ■■■yt = Vv.u i- e -> finding the posterior distribu- 
tion p(xt\yi:t) of the state, given the sequence. It is common to assume that 
the state Xt is a first-order homogeneous Markov process, with initial density 
p{xi) and transition density p(xt+i\xt), and that yt is independent of states 
and observations at all other times given xt, with observation density p(y t \xt). 
Baycs's Rule gives a recursive filtering formula, 



is the predictive distribution, which convolves the previous filtered distribution 
with the transition density. In principle, Equations ((T|) and ([2]) give a complete, 
recursive solution to the filtering problem for state-space models: the mean- 
squared optimal point estimate is simply the mean of the posterior density 
given by Equation (TTJ). When the dynamics are nonlinear, non-Gaussian, or 
even just high-dimensional, however, computing these estimates sequentially 
can be a major challenge. 

One approach to Baycsian computation is to attempt to simulate from the 
posterior distribution. Applying Monte Carlo simulation to Equations {I])-© 
would let us draw from p(x t \yi-.t), if we had p(xt\yi-.t-i) ■ The insight of parti- 
cle filtering is that the latter distribution can itself be approximated by Monte 
Carlo simulation (Kitagawa 1996; Doucet et al. 2001). This turns the recursive 
equations for the filtering distribution into a stochastic dynamical system of 
interacting particles (Del Moral and Miclo 2000), each representing one draw 
from that posterior. While particle filtering has proven itself to be useful in 
practice (Doucet et al. 2001; Brockwell, Rojas and Kass 2004; Ergiin, Barbieri, 
Eden, Wilson and Brown 2007), like any Monte Carlo scheme it can be com- 
putationally costly; moreover, the number of particles (and so the amount of 
computation) needed for a given accuracy grows rapidly with the dimensional- 
ity of the state-space. For real-time processing, such as neural decoding, the 
computational cost of effective particle filtering can quickly become prohibitive. 



P(xt\yi-.t) 



p(yt\x t )p(x t \yi:t-i) 



(1) 



/ p(y t \xt)p(xt\yi:t-i)dxt' 



where 
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The primary difficulty with the nonlinear filtering equations comes from 
their integrals. We use Laplace's method to obtain estimates of the mean and 
variance of the posterior density in Eq. (|T|), and then approximate that density 
by a Gaussian with that mean and variance. This distribution is then recursively 
updated in its turn when the next observation is taken. 

There are several versions of Laplace's method, all of which replace integrals 
with series expansion around the maxima of integrands. An expansion param- 
eter, 7, measures the concentration of the integrand about its peak. In the 
simplest version, the posterior distribution is replaced by a Gaussian centered 
at the posterior mode. Under mild regularity conditions, this gives a first-order 
approximation of posterior expectations, with error of order 0(7 _1 ). Several 
papers have applied some form of first-order Laplace approximation sequentially 
(Brown, Frank, Tang, Quirk and Wilson 1998; Eden et al. 2004). In the ordi- 
nary static context, Tierney, Kass and Kadanc (1989) analyzed the way a refined 
procedure, the "fully exponential" Laplace approximation, gives a second-order 
approximation for posterior expectations, having an error of order (9(7~ 2 ). In 
this paper we provide theoretical results justifying these approximations in the 
sequential context. Because state estimation proceeds recursively over time, it is 
conceivable that the approximation error could accumulate, which would make 
the approach ineffective. Our results show that, under reasonable regularity 
conditions, this does not happen: the posterior mean from the LGF approxi- 
mates the true posterior mean with error 0(7"") uniformly across time, where 
a = 1 or 2 depending on the order of the LGF. 

We specify the LGF in Section [21 and give our theoretical results in Section 
[H Section [4] introduces the neural decoding problem and reports comparative 
results both in simulation studies and with real data. We provide additional 
comments in Section [S] Proofs and implementation details are collected in the 
appendix^ 

2 The Laplace-Gaussian filter (LGF) 

Throughout the paper, x^u and v t u denote the mode and variance of the true 
filtered distribution at time t given a sequence of observations yi-.t, and simi- 
larly x t u_i and t>t|t_i are those of the predictive distribution at time t given 
Ui-.t-ii respectively. Hats " and tildes" on variables indicate approximations; in 
particular, x denotes the approximated posterior mode, and x the approximated 
posterior mean. The transpose of a matrix A is written A T . Bold type of a 
small letter indicates a column vector. 

2.1 Algorithm 

The LGF procedure for a one-dimensional state is as follows. (The multi- 
dimensional extension is straightforward; see below.) 

1 Appendices B-E appeared as a supplementary file in the journal version. 
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1. At time t = 1, initialize the predictive distribution of the state, p{x\). 

2. Observe y*. 

3. (Filtering) Obtain the approximate posterior mean x t \t and variance v t \t 
by Laplace's method (see below), and set p(xt\yi-t) to be a Gaussian dis- 
tribution with the same mean and variance. 

4. (Prediction) Calculate the predictive distribution, 

p(x t+ x\yx;t) = p(x t +i\xt)p(xt\yx : t)dx t . (3) 



5. Increment t and go to step 2. 

We consider first- and second- order Laplace's approximations. In the first- 
order Laplace approximation the posterior mean and variance are x t \t = Xt\t = 
argmax a , t l(x t ) and v t \t = H"(£t|t)]~\ where l(x t ) = logp(y t \x t )p(x t \yx:t-i)- 

The second-order (fully exponential) Laplace approximation is calculated as 
follows (Tierney et al. 1989). For a given positive function g of the state, let 
k(xt) = log g(xt)p(yt\xt)p(xt\yi:t-x), and let x t u maximize k. The posterior 
expectation of g for the second order approximation is then 

/ m i \- k"(x tlt )\-i exp[k{x t \ t )} 

E[g(xt)\yx:t] « - — i ■ (4) 

|-Z"(a; t | t )| 2 exp[Z(x t | t )] 

When the 3 we care about is not necessarily positive, a simple and practical 
trick is to add a large constant c to g so that g(x) + c > 0, apply Eq. and 
then subtract c. The posterior mean is thus calculated as x t \ t = E[xt + c] — c. 
(In practice it suffices that the probability of the event {g{xt) + c > 0} is 
close to one under the true distribution of xt- Allowing this to be merely very 
probable rather than almost sure introduces additional approximation error, 
which however can be made arbitrarily small simply by increasing the constant 
c. See Tierney et al. (1989) for details.) The posterior variance is set to be 
Vt\t = [— 1 1 )] 1 , as this suffices for second-order accuracy (see Remark[T]in 
Appendix |A")) . 

To use the LGF to estimate a state in d-dimensional space, one simply 
takes the function g to be each coordinate in turn, i.e., g(x t ) = Xt t i + c for 
each i = 1,2, . . .,d. Each g is a function of M d — > K, and | — l"(x t u)\~? and 
I — fc"(x t | t )| _ 2 in Eq. (j4} are replaced by the determinants of the Hessians of 
l(x t u) and k(x t u), respectively. Thus, estimating the state with the second- 
order LGF takes d times as long as using the first-order LGF, since posterior 
means of each component of x t must be calculated separately. 

In many applications the state process is taken to be a linear Gaussian 
process (such as an autorcgrcssion or random walk) so that the integral in 
Eq. © is analytic. When this integral is not done analytically, cither the 
asymptotic expansion (|23[) or a numerical method may be employed. To apply 
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our theoretical results, the numerical error in the integration must be 0(j~ a ), 
where 7 is the expansion parameter, to be discussed in section [37T1 and a = 1 
or 2 depending on the order of the LGF. 



2.2 Smoothing 

The LGF can also be used for smoothing. That is, given the observation up to 
time T, yi.T, smoothed state distributions, p(xt\yi-.T), t <T, can be calculated 
from filtered and predictive distributions by recursing backwards (Anderson 
and Moore 1979). Instead of the true filtered and predictive distributions, 
however, we now have the approximated filtered and predictive distributions 
computed by the LGF. By using these approximated distributions, the approx- 
imated smoothed distributions can be obtained as 

(■/ 1 \ ~ft \ f P( x t+i\yi:T)p{xt+i\x t ) . . 

P(Xt\yi:T) = P(Xt\yi:t) / ~, j ; dx t +l. (5) 

J P{x t+1 \y 1:t ) 
We address the accuracy of LGF smoothing in Theorem [5j 



2.3 Implementation 

Two aspects of the numerical implementation of the LGF call for special com- 
ment: maximizing the likelihood and computing its second derivatives. One 
key point is that the Hessian in Eq. may be computed by careful numeri- 
cal differentiation. Avoiding analytical derivatives saves substantial time when 
fitting many alternative models. See Appendix iBl for a brief description of our 
numerical procedure for computing the Hessian matrix, and Kass (1987) for full 
details. 

The log-likelihood function can be maximized by an iterative algorithm (e.g. 
Newton's method), in which Xt\t-i and x t \ t would be chosen as a reasonable 
starting point for maximizing l(xt) and k(xt), respectively. The convergence 
criterion also deserves some care. Writing xW for the value attained on the i th 
step of the iteration, the iteration should be stopped when — a;W|| < 

C7~ Q , where c is a constant and 7 is the expansion parameter, to be discussed 
in section 13.11 and a is the order of the Laplace approximation. The value of c 
should be smaller than that of 7 (c = 1 is a reasonable choice in practice) . 



3 Theoretical Results 

For simplicity, we state the results for the one dimensional case. The exten- 
sion to the multi-dimensional case is notationally somewhat cumbersome but 
conceptually straightforward. Let p and p denote the true density of a random 
variable and its approximation, and let h(xt) be 

h(x t ) = -- \ogp(yt\xt)p(x t \yi:t-i), (6) 
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where 7 is the expansion parameter, whose meaning will be explained later in 
this section. 



3.1 Regularity conditions 

The following properties are the regularity conditions that are sufficient for the 
validity of Laplace's method (Erdelyi f956; Kass, Tierney and Kadane 1990; 
Wojdylo 2006). 

(C.l) h(xt) is a constant-order function of 7 as 7 — > 00, and is five-times 
differcntiablc with respect to x t . 

(C.2) h(xt) has an unique interior minimum, and its second derivative is posi- 
tive (the Hessian matrix is positive definite for multi-dimensional cases) 

(C.3) p(xt+i\xt) is four-times differcntiablc with respect to Xt- 

(C.4) The integral 



exists and is finite. 

We also assume the following condition which prohibits ill-behaved, "explosive" 
trajectories in state space: 

(C.5) Derivatives of h(xt) up to fifth-order and those of p{xt+\ \xt) with respect 
to xt up to third-order are bounded uniformly across time. 

Strictly speaking, h(xt) is a random variable, taking values in the space of 
intcgrable non-negative functions of M. This random variable is measurable with 
respect to cr(j/i : t). Therefore, the stated regularity conditions only need to hold 
with probability 1 under the distribution of yut (Kass et al. 1990). 

In the following section we will state the theorems that ensure that, under 
conditions (C.1)-(C.5), the LGF does not accumulate error over time, but first 
we explain the meaning of the expansion parameter. 

Meaning of 7 As seen in Eq. (O and the regularity condition (C.l), for a 
given state-space model, 7 is constructed by combing the model parameters so 
that the log posterior density is scaled by 7 as 7 —> 00. In general, 7 would be 
interpreted in terms of sample size, the concentration of the observation density, 
and the inverse of the noise in the state dynamics; we will describe how 7 is 
chosen for a neural decoding model in section 3J From the construction of 7, 
the second derivative of the log posterior density, which determines the concen- 
tration of the posterior density, is also scaled by 7. Therefore, the larger 7 is, 
the more precisely variables can be estimated, and the more accurate Laplace's 
method becomes. When the concentration of the posterior density is not uni- 
form across state-dimentions in a multidimensional case, a multidimensional 7 
could be taken. Without a loss of approximation accuracy, however, a sim- 
ple implementation for this case is taking the largest 7 as a single expansion 
parameter. 
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3.2 Main theoretical results 



Theorem 1 (accuracy of predictive distributions) Under the regularity con- 
ditions (C.1)-(C4), the a-order LGF approximates the predictive distribution 
as 

p{x t \yi:t-i) = p(x t \yv.t-i) + 0(7^), 

for i 6 N, where ft = 1 for a — 1 and ft = 2 for a > 2. Furthermore, if condition 
(C.5) holds, the error term is bounded uniformly across time. 

The error bound can also be established for the posterior (filtered) expecta- 
tions in the following theorem. 

Theorem 2 (accuracy of posterior expectations) Under the regularity con- 
ditions (C.1)-(C4), the a-order LGF approximates the filtered conditional ex- 
pectation of a four-times differ entiable function g(x), 

E[g(x t )\y ut ] = E\g(x t )\yut] + 0(7"*), 

for t G N, with ft as in Theorem]^ Here E[-\y\-t\ and E[-\y\-t\ denote the 
expectation with respect to p(xt\y\-.t) o.nd p{xt\yi:t), respectively. Furthermore, 
if condition (C.5) holds, the error term is bounded uniformly across time. 

Note that the order of the error is 7 -2 even for a > 2 both in Theorem [T] and 
Theorem[2j That is, even if higher than the second-order Laplace approximation 
in Step 3 of the LGF is employed, the resulting approximation error does not 
go beyond the second-order accuracy with respect to This fact leads to 

the following corollary. 

Corollary 3 The second-order approximation is the best achievable for the LGF 
scheme. 

The following theorem refers to stability of the LGF. It states that minor 
differences in the initially-guessed distribution of the state tend to be reduced, 
rather than amplified, by conditioning on further observations, even under the 
Laplace's approximation. 

Theorem 4 (stability of the algorithm) Suppose that two approximated pre- 
dictive distributions at time t satisfy 

p 1 {x t \y 1 ..t-i) - fa(x t \yut-i) = 0{j~ u ), 

where v > 0. Then, under the regularity conditions (C.1)-(C.4), applying the 
LGF u(> 0) times leads to the difference of two approximated predictive distri- 
butions at time t + u as 

Pl{x t +u\yi:t+u-l) ~ P2{x t +u\yi:t+u-l) = 0(-f~ U ~ U ). 
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Theorem 5 Under the regularity conditions (C.1)-(C4), the expectation of a 
four-times differ entiable function g(x) with respect to the approximated smoothed 
distribution Eq. is given by 

E[g(x t )\ yi ., T } = E[g(x t )\y 1:T ] + 0(^), 

for t = 1,2, ... ,T, with (3 as in Theorem^ Furthermore, if condition ( C.5) is 
satisfied, the error term is bounded uniformly across time. 

3.3 Computational cost 

Assuming that the maximization of l(xt) and k{x t ) is done by Newton's method, 
the time complexity of the LGF goes as follows. Let d be the number of di- 
mensions of the state, T the number of time steps, and N be the sample size. 
The bottleneck of the computational cost in the first-order LGF comes from 
maximization of l(xt) at each time t. In each iteration of Newton's method, 
evaluation of the Hessian matrix of l(xt) typically costs 0(Nd 2 ), as d 2 is the 
time complexity for matrix manipulation. Over T time steps, the time com- 
plexity of the first-order LGF is 0(TNd 2 ). In the second-order LGF, the time 
complexity of calculating the posterior expectation of each xt,i is still 0{Nd 2 ), 
but calculating it for i = l,...,d results in 0(Nd 3 ). Repeating over T time 
steps, the complexity of the second-order LGF is 0(TNd 3 ). 

For comparison, take the time complexity of a particle filter (PF) with M 
particles. It is not hard to check that the computational cost across time step 
T of the particle filter is 0(TMNd). For the computational cost of the particle 
filter to be comparable with an LGF, the number of particles should be M = 
0(d) for the first-order LGF and M = 0(d 2 ) for the second-order LGF. 

4 Application to neural decoding 

The problem of neural decoding consists in using an organism's neural activ- 
ity to draw inferences about the organism's environment and its interaction 
therewith — sensory stimuli, bodily states, motor behaviors, etc. (Rieke, War- 
land, de Ruyter van Steveninck Rob and Bialek 1997). Scientifically, neural 
decoding is vital to studying neural information processing, as reflected by the 
many proposed decoding techniques (Dayan and Abbott 2001). Its engineer- 
ing importance comes from efforts to design brain-machine interface devices, 
especially neural motor prostheses (Schwartz 2004) such as computer cursors, 
robotic arms, etc. The brain-machine interface devices must determine, from 
real-time neural recordings, what motion the user desires the prosthesis to have. 
Such considerations have led to many proposals, emanating from Brown et al. 
(1998), for neural decoding based on state-space models (Brockwell ct al. 2007). 

In the rest of this section, we introduce a standard model setup for neural 
decoding tasks, and identify its Laplace expansion 7. We then simulate the 
model and apply the LGF, confirming the applicability of our theoretical results, 
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and comparing its performance to particle filtering. Finally, we apply the model 
and the LGF to experimental data. 

4.1 Model setup 

We consider the problem of decoding a "state process" from the firing of an en- 
semble of neurons. Here we suppose that neurons respond to a x t £ M. d , where d 
is the number of dimensions. x t may be interpreted as two- or three-dimensional 
hand kinematics for motor cortical decoding (Georgopoulos, Schwartz and Kettner 
1986; Ketter, Schwartz and Georgopoulos 1988; Paninski, Fellows, Hatsopoulos 
and Donoghue 2004), or hand posture (about 20 degrees of freedom) for dexter- 
ous grasping control (Artemiadis, Shakhnarovich, Vargas-Irwin, Donoghue and 
Black 2007). We consider N such neurons, and assume that the logarithm of the 
firing rate of neuron i is (Truccolo, Eden, Fellows, Donoghue and Brown 2005) 

logAi(xi) = oci +0i-x t ■ (7) 

We let yi t t be the spike count of neuron i at time-step t. We assume that y^^ 
has a Poisson distribution with intensity Xi(x t )A, where A is the duration of 
the short time-intervals over which spikes are counted at each time step. We 
also assume that firing of neurons is conditionally independent with each other 
given x t . Thus the probability distribution of y t , the vector of all the y^t, is 
the product of the firing probabilities of each neuron. We assume that the state 
model is given by 

x t = Fx t -i + e t , (8) 

where F S R dxd and e t is a d-dimensional Gaussian random variable with mean 
zero and covariance matrix W G R dxd . 

The expansion parameter 7 for this model is identified as follows. The second 
derivative of l(x t ) = \ogp(y t \xt)p{x t \yi;t-i) is 

N 

l"(x t ) = -A^ f3 lC xp(a t + j3 t ■ x t )f)J - V^, 

i=l 

where Vt|t_i is the covariance matrix of the predictive distribution at time t. 
Then, from the Cauchy-Schwarz inequality, 

N 

\\l"{x t )\\ < A^Tcxp^ +/3i ■ z t )||AH 2 + WV^l 

i=l 

Since HV^j^JI is scaled by ||VF _1 ||, we can identify the expansion parameter: 

N 

T^A^e^lAf + HW- 1 !!. (9) 

We see that 7 combines the number and the mean firing rate of the neurons, 
the sharpness of neuronal tuning curves and the noise in the state dynamics. 
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Given our assumptions, the observation model p{y t \x t ) and the state tran- 
sition density p(xt+i\xt) are strictly log-concave and have an unique interior 
maximum in x t , and their derivatives up to fifth-order arc uniformly bounded 
if the state is bounded. Furthermore, h(xt) is a constant-order function of 7 
as 7 — > 00, which can be seen from the construction of 7. Thus, the regularity 
conditions (C.1)-(C.5) arc satisfied if the initial distribution satisfies them. 

In what follows, we took the initial value for filtering to be the true state 
at t = 0. Note that when there is no information about the initial distribution, 
we could use a "diffuse" prior density whose covariance is taken to be large 
(Durbin and Koopman 2001). Either type of initial condition would satisfy the 
regularity conditions. We can thus construct LGFs according to section [5] 

4.2 Simulation study 

We performed numerical simulations to study first and second-order LGF (la- 
beled by LGF-1 and LGF-2, respectively) approximations under conditions rel- 
evant to the neural decoding problems we are working on. We also compared 
LGF to particle filtering. We judged performance by accuracy in computing 
the posterior mean (which was determined by particle filtering with a very large 
number of particles). However, the posterior mean contains statistical inaccu- 
racy (due to limited data). We also evaluated the accuracy with which the 
several alternative methods approximate the underlying true state. 

Simulation Setup In each simulation run, we generated a state trajectory 
from a d-dimensional AR process, Eq. flHJ), with F = 0.947 and W = 0.0197, I 
being the identity matrix, over T — 30 time-steps of duration A = 0.03 seconds. 
We examined different number of state-dimensions, d = 6, 10, 20, 30. Regardless 
of d, we observed neural activity due to the state through TV = 100 neurons, 
with a* = 2.5 + Af(0, 1) and (3i uniformly distributed on the unit sphere in R d . 
Finally, the spike counts were drawn from Poisson distributions with the firing 
rates \ {x t ) given by Eq. ([7]) above. 

Methods To compare LGF state estimates to the posterior mean we first 
needed a high-accuracy evaluation of the posterior mean itself. We obtained this 
by averaging results from ten independent realizations of particle filtering with 
10 6 particles; the resulting approximation error in the mean integrated squared 
error (MISE) is O(10~ 7 ), and so negligible for our purposes. We also applied 
the particle filter (PF) for comparison. The number of particles in the PF was 
chosen by consideration of computational cost; as discussed in subsection 13.31 
a LGF-1 is comparable in time complexity to a PF with 0(d) particles, and a 
LGF-2 is comparable to a PF with 0(d 2 ) particles. For the case of d < 30, 100 
particles (PF-100) was about the least number at which the PF was effective 
and was not much more resource-intensive than the LGF-1. In order that the 
computational time of a PF matchs that of the LGF-2, we chose 100, 300, 500 
and 1000 particles for d = 6, 10, 20 and 30, respectively. (We label it PF-scaled.) 
See also Table 
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We implemented all the algorithms in Matlab, and we ran them on Windows 
computer with Pentium 4 CPU, 3.80GHz and 3.50GB of RAM. 

Results The first four rows in Table Q] show the four filters' MISE in ap- 
proximating the actual posterior mean. LGF-2 gives the best approximation, 
followed by LGF-1; both arc better than PF-100 and PF-scalcd. Note that 
LGF-1 is much faster than PF-100, and the computational time of LGF-2 is 
approximately the same as that of PF-scaled (Table [2]). Figure Q] displays the 
MISE of particle filters in approximating the actual posterior mean as a function 
of the number of particles, for d = 6. PF needs on the order of 10 4 particles 
to be as accurate as LGF-1, and about 10 6 particles to match LGF-2. Further- 
more, since the computational time of the PF is proportional to the number 
of particles, the time needed to decode by PF with 10 4 and 10 6 particles are 
expected to be about 20s and 2,000s, respectively (from Table [2]). Thus, if we 
allow the LGFs and the PF to have the same accuracy, LGF-1 is about 1,000 
times faster than the PF, and LGF-2 is expected to be about 10, 000 times faster 
than the PF. 

Table 1: MISEs for different filters 



Method Number of dimensions, d 





6 


10 


20 


30 


LGF-2 


0.0000008 


0.000002 


0.00001 


0.00006 


LGF-1 


0.00003 


0.00004 


0.0001 


0.0002 


PF-100 


0.006 


0.01 


0.03 


0.04 


PF-scaled 


0.006 


0.007 


0.01 


0.02 


posterior 


0.03 


0.04 


0.06 


0.07 



NOTE: The first four rows give the discrepancy between four approximate filters 
and the optimal filter (approximation error). The fifth row gives the MISE 
between the true state and the estimate of the optimal filter, i.e., the actual 
posterior mean (statistical error). All values arc means from 10 independent 
replicates. The simulation standard errors are all smaller than the leading digits 
in the table. 

The value of 7 for this state-space model is 7 « 100 (Eq. ([9])). From Theorem 
[3 the MISEs of LGF-1 and LGF-2 are, respectively, evaluated as c\ 7 -2 and 
C27 -4 , where C\ and C2 are constants depending on the model parameters. If c\ 
and c 2 were in the range 1 to 10, then the MISEs of LGF-1 and LGF-2 should 
be 10 -4 to 10 -6 , roughly matching the simulation results. 

The fifth row of Table Q] shows the MISE between the true state and the 
actual posterior mean. The error in using the optimal filter, i.e., the actual 
posterior mean, to estimate the true state is statistical error, inherent in the 
system's stochastic characteristics, and not due to the approximations. The 
statistical error is an order of magnitude larger than the approximation error 



11 



Table 2: Time (seconds) needed to decode 



Method Number of dimensions, d 





6 


10 


20 


30 


LGF-2 


0.24 


0.43 


1.0 


2.0 


LGF-1 


0.018 


0.024 


0.032 


0.056 


PF-100 


0.18 


0.18 


0.18 


0.19 


PF-scaled 


0.18 


0.50 


0.81 


1.8 



NOTE: All values are means from 10 independent replicates. The simulation 
standard errors are all smaller than the leading digits in the table. 




Number of particles 



Figure 1: Scaling of the MISE for particle filters. The solid line represents the 
MISE (vertical axis) of the particle filter as a function of the number of particles 
(horizontal axis). Error here is with respect to the actual posterior expectation 
(optimal filter). The dashed and dotted horizontal lines represent the MISEs 
for the first- and second-order LGF, respectively. 
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in the LGFs, so that increasing the accuracy with which the posterior expec- 
tation is approximated does little to improve the estimation of the state. The 
approximation error in the PFs, however, becomes on the same order as the 
statistical error when the state dimension is larger (d = 20 or 30). In such cases 
the inaccuracy of the PF will produce comparatively inaccurate estimates of the 
true state. 

Finally, we examined how the choice of initial prior density affects the fil- 
tering result. Figure [5] shows five estimated trajectories started with different 
initial values. These five trajectories converged to the same state as the time 
evolves, as expected from Theorem UJ 

0.8 
0.6 
0.4 
o> 0.2 
05 
-0.2 
-0.4 
-0.6 

5 10 15 20 25 30 

Time 

Figure 2: The solid lines represent the estimated trajectories with five different 
initial values by LGF-1. The dashed line represents the true state trajectory. 



4.3 Real data analysis 

Experiment setting and data collection We used LGF to estimate the 
hand motion from neural activity. A multi-electrode array was implanted in the 
motor cortex of a monkey to record neural activity following procedures similar 
to those described previously in Velliste, Perel, Spalding, Whitford and Schwartz 
(2008). In all, 78 distinct neurons were recorded simultaneously. Raw voltage 
waveforms were thrcsholdcd and spikes were sorted to isolate the activity of 
individual cells. A monkey in this experiment was presented with a virtual 3-D 
space, containing a cursor which was controlled by the subject's hand position, 
and eight possible targets which were located on the corners of a cube. The task 
was to move the cursor to a highlighted target from the middle of the cube; the 
monkey received a reward upon successful completion. In our data each trial 
consisted of time series of spike-counts from the recorded neurons, along with 
the recorded hand positions, and hand velocities found by taking differences 
in hand position at successive A = 0.03s intervals. Each trial contained 23 
time-steps on average. Our data set consisted of 104 such trials. 
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Methods For decoding, we used the same state-space model as in our simu- 
lation study. Many neurons in the motor cortex fire preferentially in response 
to the velocity v t € M 3 and the position z t G K 3 of the hand (Wang, Chan, 
Heldman and Moran 2007). We thus took the state x t to be a 6-dimensional 
concatenated vector x t = (z t ,v t ). The state model was taken to be 



where e t is a 3-D Gaussian random variable with mean zero and covariancc 
matrix cr 2 J, J being the identity matrix. 16 trials consisting of 2 presentations 
of each of the 8 targets, were reserved for estimating the parameters of the 
model. The parameters in the firing rate, on and /3;, were estimated by Poisson 
regression of spike counts on cursor position and velocity, and the value of 
a 2 was determined via maximum likelihood. The time-lag between the hand 
movement and each neural activity was also estimated from the same training 
data. This was done by fitting a model over different values of time-lag ranging 
from to 3As. The estimated optimal time-lag was the value at which the 
model had the highest R 2 . Having estimated all the parameters, cursor motions 
were reconstructed from spike trains for the other 88 trials, and it is on these 
trials we focused. For comparison, we also reconstructed the cursor motion 
with a PF-100 and a widely-used population vector algorithm (PVA) (Dayan 
and Abbott 2001, pp. 97-101) (see also Appendix [D} . 

Results Figure [3] compares MISEs for different algorithms in estimating the 
true cursor position. Figure [3] (a) compares the MISE of LGF-1 with that of 
LGF-2. Just like in the simulation study, there is no substantial difference 
between them since the statistical error is larger than the LGFs' approximation 
errors. Figure G2 (b) compares LGF-1 to PF-100: the former estimates the true 
cursor position better than the latter in most trials. Also (Table [2]), LGF-1 is 
much faster than PF-100. Figure [3] (c) shows that the numerical error in the 
PF-100 is of the same order as the error resulting from using PVA. (Plots of the 
true and reconstructed cursor trajectories are shown in Appendix [EJ ) 

5 Discussion 

In this paper we have shown that, under suitable regularity conditions, the 
error of the LGF does not accumulate across time. In the context of a neural 
decoding example we found the LGF to be much more accurate than the particle 
filter with the same computational cost: in our simulation study the first-order 
and second-order LGFs had MISE of about 1/200 to 1/7,500 the size of the 
particle filter. We also found that for 6-dimcnsional case, about 10,000 particles 
were required in order for the particle filtering to become competitive with the 
first-order LGF; and the second-order LGF remained as accurate as the particle 
filter with 1,000,000 particles. In many situations (such as some neural decoding 




(10) 
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(a) 



/ 



0.5 1 
LGF2 




0.5 1 1.5 2 2.5 
PF100 „,„-a 



Figure 3: Algorithm comparisons. The horizontal and vertical axes represent 
the MISE of different algorithms in estimating the true cursor position. Each 
point compares two different algorithms for a trial. Overall, 4 algorithms (LGF- 
1, LGF-2, PF-100 and PVA) were compared for 88 trials, (a) LGF-2 vs. LGF-1, 
(b) LGF-1 vs. PF-100, and (c) PF-100 vs. PVA. 



applications), implementation needs to be easy so that repeated refinements in 
modeling assumptions may be carried out quickly. With this in mind, it might 
be argued that the simplicity of the particle filter gives it some advantages. 
We have, however, noted how numerical methods may be used to supply the 
necessary second-derivative matrices (see Appendix iBl and Kass (1987)), and 
these, together with maximization algorithms, make it as easy to modify the 
LGF for new variations on models as it is to modify the particle filter. Nor 
does the use of the LGF interfere with diagnostic tests and model-adequacy 
checks, such as the time-rescaling theorem for point processes (Brown, Barbieri, 
Ventura, Kass and Frank 2002). The obvious conclusion is that the LGF is 
likely to be preferable to the particle filter in applications where the posterior 
in Eq. (p} becomes concentrated. 

We should note that the validity of the LGF is guaranteed only when the 
posterior distribution is uni-modal and has a log-concave property. On the 
other hand, the particle filter is a distribution-free method and can be used in 
a multi-modal case. 

It is perhaps worth emphasizing the distinction between the LGF and other 
alternatives to the Kalman filter. The simplest non-linear filter, the extended 
Kalman filter (EKF) (Ahmed 1998), linearizes the state dynamics and the ob- 
servation function around the current state estimate x, assuming Gaussian dis- 
tributions for both. The error thus depends on the strength of the quadratic 
nonlinearities and the accuracy of preceding estimates, and so error can accu- 
mulate dramatically. The LGF makes no linear approximations — every filtering 
step is a (generally simple) nonlinear optimization — nor does it need to approx- 
imate cither the state dynamics or the observation noise as Gaussians. 

In our simulation studies, the second-order LGF was always more (in some 
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cases much more) than 20 times as accurate as the first-order LGF in approx- 
imating the posterior, but this translated into only small gains in decoding 
accuracy. The reason is simply that the inherent statistical error of the poste- 
rior itself was much larger than the numerical error of the first-order LGF in 
approximating the posterior. We would expect this to be the case quite gener- 
ally. Thus, our work may be seen as supporting the use of the first-order LGF, 
as applied to neural decoding in Brown et al. (1998). 

Finally, an interesting idea is to use a sequential approximation to the 
posterior based on some well-behaved and low-dimensional parametric family, 
and to apply sequential simulation based on that family. The Gaussian could 
again be used (e.g., (Azimi-Sadjadi and Krishnaprasad 2005; Brigo, Hanzon and 
LeGland 1995; Ergiin et al. 2007)), and our results would provide new theoret- 
ical justification for such procedures. However, it is well-known that Gaussian 
distributions, with their very thin tails, are poorly suited for importance sam- 
pling, so that heavier-tailed alternatives often work better (e.g., (Evans and 
Swartz 1995)). Sequential simulation schemes with approximating Gaussians 
replaced by multivariate t. or other heavy-tailed distributions, may be worth 
exploring. 
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APPENDIX A. Proofs of theorems 



We begin by proving a lemma and a proposition needed for the main theorems. 
To simplify notation we introduce the symbols h[ 1 ^ = d l h{x t ) / dx\\ Xt=Xt]t and 
q®(x t +i) = d l p(x t+ i\x t )/dx l t \ Xt = Xtlt . 

Lemma 6 Let h(xt) be 

H x t) = -~logp(y t \x t )p(xt\yi :t -i) , (11) 
7 

hf = d Xt h(x t \t), and i t \t the minimizer of h(xt). Then, under the regular- 
ity conditions, the order-a Laplace approximation of the posterior mean and 
variance have series expansions as 



a-l 



and 



x tl t = J2MW ) }h- j , (12) 

3=0 

a-l 

v t \t = J2 B ^' h t l) }h~ j > ( 13 ) 

3=1 

where the coefficients, Aj andBj, are functions of {h[^}. 

Proof (Lemma [6|) The expectation of a function g(x t ) with respect to the 
approximated posterior distribution is 

E[ 9 ^)\m,) - S9M^hM ]dxt (14) 

J exp [-^h{xt)\ax t 

where g(x t ) = x t for the mean and g(x t ) = x\ for the second moment. We get 
the coefhcients Aj and Bj by applying Laplace's method, an (infinite) asymp- 
totic expansion of a Laplace- type integral (Theorem 1.1 in (Wojdylo 2006); see 
Appendix [Cl for a brief summary), to both the numerator and the denominator 
of Eq. (|14|): those formulae also show that the coefhcients are functions of {Kp }, 

I = 1,2, For example, the coefficients of up to first-order terms are obtained 

as A ({hl l) }) = x t{u A 1 ({h[ l) }) = -h'l'/{2{hl{)\ and B^h®}) = (h'/)-\ □ 

Remark 1 Lemma [6] guarantees that the choice of x t \t = x t \t and v t \t = 
(^h'l) -1 provides the first-order approximation of posterior mean and variance. 
As proved in Tierney et al. (1989), Eq. Q achieves the second-order expansion of 
the posterior mean x t \ t = x t \ t + Ai({h[ 1 ' })7 _1 - Thus Eq. Q and v t u = (^h") -1 
provide the second-order approximation. 
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Proposition 7 Suppose that the regularity conditions (C.1)-(C4) hold, and 
that the approximated predictive distribution of time t satisfies 



N 



p(x t |i/ lit -i) - P(x t \yi:t-i) + £ tA*t)l- ] + 0(j- N ~ 1 ), (15) 



3=v 



where £t,j(xt) "is a constant- order function 0/7 and < v < N for i/,N £ N. 
Replacing the filtered distribution at time t with a Gaussian with a-order Laplace 
approximated mean and variance leads to the approximate predictive distribution 
at time t + 1, 



N N 

N-l\ 



p(xt+i\yi:t)=p(x t +i\yi:t)+Yi£t+i,j( x t+ih 3 +^£t+i,i+i{x t +i)i 3 1 +0(7" 

3=P 3=v 

(16) 

where fj = 1 for a = 1 and (3 = 2 for a > 2. Here £ t * + i j(xt+i) does not depend 
on {£t,k(xt)}k=v,u+i,... and 



c 1 ^ a'{x t+ i) d ( £ t ,j{xt) 

c-t+l,j+l{X t +l) - 



h" dx t \p(x t \yi:t-i) 



+ 0(7" 1 ), (17) 

x t =x t , t 



for j = v,v + 1, . . . , N . Furthermore, if the condition (C.5) is satisfied, the 
coefficients of the expansion terms in Eq. 116\) are bounded uniformly across 
time. 

Proof (Proposition [7J The proof works by comparing the asymptotic ex- 
pansions of the true and approximated predictive distributions. To do this, we 
must find those asymptotic expansions; once this is done the remaining steps 
are fairly straightforward. 

(i) We begin by evaluating the true predictive distribution at time t + 1. 
From Eqs. ((J) and ©, this is 

/ p(x t+ i\x t )exp[-jh{x t )]dx t 

P{Xt+l\yi:t) = f r w m , ■ 

J exp [-jh(xt)\dx t 

Applying Laplace's method (Theorem 1.1 in (Wojdylo 2006), see also Appendix 
|C|) to both the numerator and the denominator of above equation leads to 

Ef.»r( s + |)(^)'<4,7- + 0(7-"- 1 ) 
P{X ' M = EL„r( s + i)^ m . 7 - + 0( 7 -»-V (18) 

where 
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and 



3=0 



s+1 
2 



(20) 



Here C SJ (Ai, . . .) is a partial ordinary Bell polynomial, which is the coefficient 

of x l in the formal expansion of {A\x + A-^x 1 + ■•■)■', and Ai = Ai({h[ 1 ^}) is 
the coefficient which appeared in Lemma [6] Expanding with respect to 7 — 1 , we 
obtain the asymptotic expansion of p(xt+i \yi-.t) as 

TV 

p(x t+1 \y 1:t ) = q(x t+1 ) + J2Ci(xt+ih- 3 +0( 1 ' N ~ 1 ) , (21) 

where q(xt+i) was earlier defined as p(xt+i\xt), and where Cj{x t J r \) depends 
on q^ k \x t +i) and np {k,l = 1,2,...). Cj{xt+\) is directly calculated by Eqs. 

(ii) We next consider the approximated predictive distribution of time t + 1, 



p(x t +i\yi:t) = p{xt+i\xt)p(xt\yi:t)dx t , 



(22) 



where p(xt\yi-.t) is the Gaussian distribution whose mean and variance are given 
by Eq. (fT2"]) and (fTU|) . respectively. Eq. (|22p can be re- written as 



p(x t+1 \y 1:t ) 



Applying Laplace's method again, 



p(x t+ i|x t )exp 



(x t - x t \tf 



2v t \- 



dx t 



1 g<*W) r j 



p(x t+1 \y 1:t ) = q(x t+1 ) + J2 l^+lfa + Oft/" 1 ), 



(23) 



where T(J + 1) is the Gamma function and 

-(!) _ d l p{x t+ i\x t ) 
It — 



dx t 



(24) 



Now we compare Eqs. (|23p and (|2"Tj) . via a series of substitutions. We want 
to re-write Eq. ([23]) with q^(x t+ i) and hf } . Substituting Eq. ([15]) into Eq. 
([Ill, 



h(x t ) 



1 



\ogp(y t \x t ) 



N 



p(*t|yi: t -i) + + 0( 7 - iV " 1 ) 



/v 



(25) 



19 



where 



0( 7 - 



p(xt\yv.t-i) 

is a collection of terms which depend on £ t j(xt). 

Suppose x t i t = x t \ t + e and eCl. Taking the derivative both sides of Eq. 
((2"5|) and evaluating it at x t \ t , we obtain 



JV T i 



-7V-2\ 



Then we get 



\ -> j t, 



I 7 -i-l +0 ( 7 -^-8). 



(26) 



Inserting Eq. (|2"6"|) into Eq. ([23| gives 



N 

E 



_ •>_ t,j'H 



7 -J-i + ( 7 -«-a) 



(27) 



(28) 



Substituting Eq. J26) and Eq. (J2TJ) into Eq. ([T5J leads to 

a-l N -pi 

3=1 * 
Inserting Eq. (|28|) into Eq. (|24| and expanding with respect to 

^Wo = q v\x t+1 ) + Y^M (l+1 Hx t+ ih- 3 +ir\i2 y il+k) {x t+1 )c^ k {Ai' 



3=1 



j=2 L fc=2 



iV x-/ 

+ E^? (i+1) (^)7- i - 1 +0(7- a - 1 ). 
j=f * 

Substituting Eqs. ([I3]l . p7]l and (J29J) into Eq. ([23]t . we obtain the final 
asymptotic expansion of p(xt+i\yut), 

p(x t +i\yut) = (30) 

Q N-l -pi 

q (x t+1 ) + J2Rj(xt+ih- j + Yl ^9 , (x t+ ih- J ~ 1 + o( 7 ^- 1 ) , 

3=1 i=v 1 



in which 



R , \ = ( G j( x t+i) + Ajq'(x t+1 ) l<j<a-l 



20 



and 



Gj(x t 



= l,^C J , s (A 1 ,...)g u (x t+1 ) + ^ 2sr(g + 1) 



EE 

j-2j-2j-k 

EEE 

s— 1 /c— s n—2 



A J -_ fc C fc , a (fli,...) g ( 2fl+1 >(a; t+ i) 

2 s r(s + 1) 

Cj-fc.nCAi, . . .)C fc|8 (Bi, . ■ .)g( 2s+ ")(x t+1 ) 
2 s r(s + l)n! 



where Bj = Bj({h[ }) appeared in Lemma [6l 

(iii) Now we compare Eqs. (f2~T|) and (|30p . The coefhcients, up to second 
order terms, in the former are 



Ci(x t+ i) = 



q"(x t+1 ) K'q'(x t+1 ) 



2ti; 



my 



(31) 



and 

C 2 (x t+ i] 



q^(x t+1 ) 5h' t "q"'(x t+1 ) 



8(h>r 



12(h>'r 



2h'('h t 



(4) 



Hh't") 3 



,(5) 



my $m 5 8(^') 3 



5(M") 2 ^ 
8(^') 4 4(^') 3 

?'(zt+i). 



?"(zt+i) 



(32) 



For the first-order Laplace approximation (a = 1), the coefficient of order 



7- 1 in Eq. is 



Rxixt+i) = 



q"{xt+i) 
2h'{ ' 



which does not correspond to Ci(x t +i), and hence Eq. (fTl)j) holds. 
For a > 2, Rx(x t +\) is as 



Ri(x t 



+1, 



q"(x t+1 ) K'q'{x t+1 ) 



(33) 



(34) 



which corresponds to Ci(xt+i), and the first-order error term in Eq. (1161) is 
canceled. 

The second-order error term in Eq. (|30[) is calculated as 



#2(^+1) = 
for a = 2, and 

R2(x t +i) 



g( 4 Wi) h'l'q'"{x t+1 ) 

mi) 2 mi? 

q^{x t+1 ) Kq'"(x t+l ) 



b{h'l') 2 h\ 



(4) -1 



miY mi? 

5(h>n 2 h 



q"(x t+1 ), (35) 



mi? 



4(M') 3 



(4) 



(4) 



5(^") 3 



A* (6) 



3(h<r mif m/r 



8(h<r mif 

q'(x t+ i), 



q"(xt+i) 



(36) 



21 



for a > 3. Thus ^(aJt+i) ^ ^(xt+i) and second-order error term in Eq. f| 16[) 
remains for a > 2. 

From (j3"Tj) ~ (l3l)|) . the leading error term introduced by the Gaussian approx- 
imation is 



^+1,1(^+1) = R i( x t+i) ~ Ci(a; t+ i) 



KW(x t+1 ) 

my 



for a = 1, and 



6(/4') 3 



2^;"fe| 4) 

3(^') 4 



5(/i^) 3 
8(^') 5 



for a = 2, and 



£t*+i,2 0t+i) = #2(2^+1) - c 2 {x t +i) 



Ki'"{xt + i) 

6(h'{) 3 



q'(x t +i) 



for a > 3. Thus if the condition (C.5) is satisfied, the leading error term is 
bounded uniformly across time. We can confirm in the same way that the other 
error terms are also bounded uniformly. □ 

There are two sources of error in Eq. (fTB|) : first, that due to the replacement 



of the true filtered distribution at time t by a Gaussian, ^2f = a £t+i j( x t+i)l 



j=/3 "t+l,j* 



-J-l 



and, second, that due to propagation from time t, &t+i {xt+i)l 

At each step, the Gaussian approximation introduces an 0(7"^) error into 
the predictive distribution, where (3 = 1 for a = 1 and /3 = 2 for a > 2. 
However, the errors propagated from the previous time-step "move up" one 
order of magnitude (power of 7). Applying Eq. (|17l) repeatedly, we find that the 
leading error term, £^(xt)j~^ , which is generated at time-step t, is propagated, 
by a strictly later time-step u, to be £ UiU -t+i3{x u )^^ u ^ t+ ^ where 



£u,u— {Xu) 



9'Ou) n 

k=t+l 

I d 

x 



1 d / q'(x k ) 

h'l dx k \p(x k \yi:k-i) 



h'l dx t \p(x t \yi:t-i) 



The compounded error in time-step u is then given by the summation of the 
propagated errors from t — 1 to u — 1 as 



u-1 



t=l 



where the inequality holds under the condition (C.5), C < 7 is a constant which 
is independent of time t. The right hand side in this equation converge on 
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0(7 _ ^ _1 ) as u — ► oo, so that the compounded error after infinite time-step 
remains 0(7 _/3_1 ). The result is that the whole error term in the predictive 
distribution becomes 0(-f~"), even if it started out smaller, but it does not grow 
beyond that order. Thc-oremQ]is then proved from Proposition^ immediately: 

Proof (Theorem [TJ) The LGFs start with an initial predictive distribution 
which does not involve any errors. Thus, from Proposition [7] it is proved induc- 
tively that the error in the approximated predictive distribution is 0{^~^) and 
uniformly bounded for t € N. □ 

Proof (Theorem [2]) (Sketch) Since the predictive distribution, 
p(x t+ i\yut) = / p(xt+i\xt)p{xt\yi:t)dx t 



is the posterior expectation oip(x t +\ \xt) with respect to Xt, Thcorcm[2]is proved 
in the same way as Theorem [T] (replacing p(x t +i\x t ) by g(x t ) in the proof of 
Theorem [J). □ 

Proof (Theorem [4]) From Proposition the two predictive distributions at 
time t are given by 

N 

Mxt\yut-i)=p(x t \yi:t-i) + J2 £ $(xth- j + 0( 7 - iV - 1 ), 

3=" 

and 

JV 

M**\in*-i) = P(x t \yi..t-i) + Y l £%(xth- j + oij-"- 1 ), 

where Sfj{xt) ^= £^j(xt). Applying the LGF to both predictive distributions 

introduces the same errors at time t + 1, X)j =l g &t+x j( x t+i)l~ J 'i which are can- 
celed, while propagated errors from time-step t to t + 1 in both predictive dis- 
tributions, E^^t+ij+iC^+i)^"- 7 " 1 and Y,fjJ~ Zt+ij+i&t+ih'^ 1 are not 
canceled. Then we get pi(x t+ i\yut) - hi^t+ilyut) = O^"^ 1 ). Applying this 
procedure u times completes the theorem. □ 

Proof (Theorem [5]) Assume that the expectation at time t+1 satisfies 

E\g(x t+1 )\y lsT ] = E\g(x t+1 )\y 1:T ) + 0( 7 "^). (37) 

From Theorem [1] and Eq. (|37p . we obtain 

f p(x t+1 \y 1:T )p(xt + i\xt) f p(x t+ i\yi:T)p(xt+i\x t ) ,nt-P\ 
/ j \ dx t+i = / / i s dxt +1 +U(j ) . 

J P(X t +l\yi:t) J p(Xt+l\yi:t) 
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Using Theorem [21 the expectation at time t is 



fp\ ( \ I I [ f \"( \ \ [ P( X t+l\yi:T)p(xt+l\x t ) , , „, -,3s 

■E .g (a:*) yi:T = / 9(xt)p(x t \yi:t) / -, ; ; dx t+1 dx t + O 7 p ) 

J J P(Xt+l\yi:t) 

= / 9{x t )p{x t \yi :t ) / : ; r dxt + idx t + C(7 p ) 

= £?[ff(a: t )|yi:T] + 0(7^)- 

The initial smoothed distribution of the backward recursion is given by the fil- 
tered distribution p(xt\vi:t) , which satisfies E[xt\vi:t] = E[xt\di:t] + 0(j~P) 
by theorem [2] Then, the theorem is proved inductively. □ 

APPENDIX B. Numerical Computation for sec- 
ond derivatives 

We describe the numerical algorithm for computing the Hessian matrix, as 
promised in Section 12.31 

The Laplace approximation requires the second derivative (or the Hessian 
matrix) of the log-likelihood function evaluated at its maximum. However, it is 
often difficult, and even more often tedious, to get correct analytical derivatives 
of the log-likelihood function. In such cases accurate numerical computations of 
the derivative may be used, as follows. Consider calculating the second deriva- 
tive of l(x) at xo for the one-dimensional case. For n — 0,1,2,... and c > 1, 
define the second central difference quotient, 

A n , = [l(x + c- n h a ) + l(x Q - c~ n h ) - 2l(x Q )]/(c- n h a ) 2 , 

and then for k = 1, 2, . . . , n compute 

A n M = A n , k -1 + An ' k ^\ k+ ^ ' (38) 

When the value of \A n ^ — A n -i t k\ is sufficiently small, A n ,k+i is used for the 
second derivative. 

This algorithm is an iterated version of the second central difference formula, 
often called Richardson extrapolation, producing an approximation with an error 
of order 0(h 2< - k+1 1) (Dahlquist and Bjorck 1974). 

In the d-dimcnsional case of a second-derivative approximation at a max- 
imum, Kass (1987) proposed an efficient numerical routine which reduces the 
computation of the Hessian matrix to a series of one-dimensional second-derivative 
calculations. The trick is to apply the second-difference quotient to suitably- 
defined functions / of a single variable s as follows. 

1. Initialize the increment h = (hi, ■ ■ ■ , hd). 

2. Find the maximum of l(x), and call it x. 



24 



3. Get all unmixed second derivatives for each i = 1 to d, using the function 



X-i X-i -I- s 

Xj = Xj for j not equal to i 
f(s) = l(x(s)). 

Compute the second difference quotient; then repeat and extrapolate until 
the difference in successive approximations meets a relative error criterion, 
as in (f38|) ; store as diagonal elements of the Hessian matrix array, = 
/"(0). 

4. Similarly, get all the mixed second derivatives. For each i = 1 to d, for 
each j = i + 1 to d, using the function 



Xi + s/ , I 



Xj + s/Jl" 



3 — -J-'j -ro/y tjj 

Xk = Xk for k not equal to i or j 
/(«) = «(*(«))■ 

Compute the second difference quotient; then repeat and extrapolate until 
difference in successive approximations is less than relative error criterion 
as in (|3"5|) ; store as off-diagonal elements of the Hessian matrix array, 



i& = (/"(0)/2-l)^5&- 

In practice, the increment for computing the Hessian at time t would be 
taken to be hi — 0.1 x ^ v t\t—v * = 1> 2, . . . , d, where u^^j is the (i, i)-element 
of the covariance matrix of the predictive distribution at time t. 



APPENDIX C. Laplace's Method 

Here, we briefly describe Laplace's method, especially the details used in the 
proofs of Lemma 6 and Proposition 7. 
We consider the following integral, 

Hi) = J g{x)e-~< h ^dx, (39) 

where x 6 M; 7, the expansion parameter, is a large positive real number; 
h(x) and g{x) are independent of 7 (or very weakly dependent on 7); and the 
interval of integration can be finite or infinite. Laplace's method approximates 
1(7) as a series expansion in descending powers of 7. There is a computationally 
efficient method to compute the coefficients in this infinite asymptotic expansion 
(Theorem 1.1 in (Wojdylo 2006)). Suppose that h(x) has an interior minimum 
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at xo, and h(x) and g(x) are assumed to be expandable in a neighborhood of 
xq in series of ascending powers of x. Thus, as x —> xq, 

oo 

h(x) ~ h(x ) + ^a s (x - x ) s+2 , 

s=0 

and 

oc 

g( x ) ~ E^ _ ^o) 8 ' 

in which ao, 60 7^ 0. 

Let us introduce two dimcnsionless sets of quantities, A4 = ai/ao and Bi = 
bi/bo, as well as the constants a\ = l/af/ 2 and cq = bo/a^ 2 . Then the integral 
in 1391) can be asymptotically expanded as 

00 

I( 7 ) ~ c e-^xo)^ r(s+ ) a fc* 2sl - s -K 

s=0 Z 

where 

i=0 j=0 \ J / 

where Cij(Ax, . . .) is a partial ordinary Bell polynomial, the coefficient of x l in 
the formal expansion of (A\x + A2X 2 + ■ ■ ■ ) J . C,j(Ai, . . .) can be computed by 
the following recursive formula, 

i-l 

Cij(Ai,...)= ^ Ai- m Cm,i-i{A\, . . .) , 

m=j — I 

for 1 > j > i. Note that C ofi (A 1 ,. . .) = 1, and C ii0 (^i, • • •) = Coj^i, • • •) = 
for all i,j > 0. 



APPENDIX D. The Population Vector Algorithm 

The population vector algorithm (PVA) is a standard method for neural de- 
coding, especially for directionally-sensitive neurons like the motor-cortical cells 
recorded from in the experiments we analyze (Dayan and Abbott 2001, pp. 
97-101). Briefly, the idea is that each neuron i, 1 < i < N, has a preferred mo- 
tion vector 6i, and the expected spiking rate \ varies with the inner product 
between this vector and the actual motion vector x(t), 

M|p = x[t ) . Qi , (40) 

where is a baseline firing rate for neuron i, and Aj a maximum firing rate. 
( (|40[) corresponds to a cosine tuning curve.) If one observes yi(t), the actual 
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neuronal counts over some time-window A, then averaging over neurons and 
inverting gives the population vector 



which the PVA uses as an estimate of x(t). If preferred vectors Oi are uniformly 
distributed, then x pop converges on a vector parallel to x as N —> oo, and is 
in that sense unbiased (Dayan and Abbott 2001, p. 101). If preferred vectors 
are not uniform, however, then in general the population vector gives a biased 
estimate. 



Figure [4] shows trajectories of the true and estimated (by LGF, PF-100 and 
PVA) cursor position of the real data analysis. It is seen that the LGF provides 
better estimation than either the PF-100 or the PVA. 




N 



APPENDIX E. Real data analysis 
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Figure 4: The trajectories of the cursor position. "True": actual trajectory. 
"LGF1" : trajectories estimated by first-order LGF, respectively. "PF100" : tra- 
jectory estimated by the particle filter with 100 particles. "PVA": trajectory 
estimated by the population vector algorithm. The trajectories estimated by 
the LGF2 are not shown; they are similar to those estimated by the LGF1. 
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